Downstream Analysis of Synteny Networks with MSA2dist

Kristian K Ullrich

Max Planck Institute for Evolutionary Biology

Background - DNA Evolution

A matter of distance.

Background - Coding Sequences

Synonymous and NonSynonymous Substitutions: nucleotide substitutions might lead to changes on amino acid level.

Typically used to:

  1. Detect and Date whole-genome duplication (WGD) events;
  2. Predict selective pressure acting on a protein;
  3. Predict selective pressure acting on a codon;
  4. Investigate codon usage (R/Bioconductor package coRdon).

MSA2dist

An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.

MSA2dist Features:

  1. Calculates pairwise AA, DNA and IUPAC distances;
  2. Calculates Synonymous and NonSynonymous Substitutions (dN/dS) KaKs_Calcultor 2.0 models (C++ ported to R with Rcpp);
  3. Use pre-calculated MSA or conduct all possible pairwise alignments prior dN/dS calculation;
  4. Calculate average behavior of each codon from MSA.

MSA2dist workflow

Installation

From Bioconductor:

BiocManager::install(version='devel')
BiocManager::install("MSA2dist")


From GitHub:

remotes::install_github("kullrich/MSA2dist")

Example data set

Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.

Source: Pico-PLAZA 3.0

Where to find: the data/ directory in the GitHub repo associated with this presentation.

data
├── clusters.rda
├── codingsequences.rda
└── data_acquisition_cds.md
  1. codingsequences.rda: DNAStringSet object.
  2. clusters.rda: Pre-calculated syntenet clusters.

Example data set: codingsequences

# Load necessary libraries
library(MSA2dist)
library(Biostrings)
library(dplyr)
library(stringr)

# Load data set codingsequences
load(here::here("data", "codingsequences.rda"))

# Inspect data
head(names(codingsequences))
[1] "AP00G00010" "AP00G00020" "AP00G00030" "AP00G00040" "AP00G00050"
[6] "AP00G00060"

Example data set: clusters

# Load data set clusters
load(here::here("data", "clusters.rda"))

# Inspect data
head(clusters)
                   Gene Cluster
1 Chlo_CNC64A_028G00030       1
2 Chlo_CNC64A_028G00040       2
3 Chlo_CNC64A_028G00070       3
4 Chlo_CNC64A_028G00080       4
5 Chlo_CNC64A_028G00110       5
6 Chlo_CNC64A_028G00140       6
length(table(clusters$Cluster))
[1] 22605

Fetch coding sequences for a specific cluster

# Get size distribution of clusters
head(table(clusters$Cluster))

 1  2  3  4  5  6 
38 38 37  2  9  4 
# Get one random cluster of size 10
my_cluster <- clusters %>% dplyr::filter(
    Cluster==sample(which(table(clusters$Cluster)==10), 1))
head(my_cluster)
                      Gene Cluster
1 Bpra_BPRRCC1105_14G02260   15527
2 Bpra_BPRRCC1105_14G02990   15527
3     Osp_ORCC809_02G02140   15527
4     Osp_ORCC809_02G01300   15527
5          Oluc_OL02G04530   15527
6          Oluc_OL02G02820   15527

Fetch coding sequences for a specific cluster

# Remove species unique identifier and add as GeneID
my_cluster$GeneID <- stringr::str_split_fixed(my_cluster$Gene, "_", 2)[, 2]
my_cluster
                       Gene Cluster              GeneID
1  Bpra_BPRRCC1105_14G02260   15527 BPRRCC1105_14G02260
2  Bpra_BPRRCC1105_14G02990   15527 BPRRCC1105_14G02990
3      Osp_ORCC809_02G02140   15527    ORCC809_02G02140
4      Osp_ORCC809_02G01300   15527    ORCC809_02G01300
5           Oluc_OL02G04530   15527          OL02G04530
6           Oluc_OL02G02820   15527          OL02G02820
7          Omed_OM_04G02150   15527         OM_04G02150
8          Omed_OM_04G01300   15527         OM_04G01300
9          Otau_OT_02G00670   15527         OT_02G00670
10         Otau_OT_02G01820   15527         OT_02G01820
# Fetch coding sequences
my_cds <- codingsequences[match(my_cluster$GeneID, 
    names(codingsequences))]
my_cds
DNAStringSet object of length 10:
     width seq                                              names               
 [1]  2151 ATGAGCATAGAAACCAAGCAACG...CGATATCATATATTCGGTGTGA BPRRCC1105_14G02260
 [2]  2130 ATGGGCGTACCCGATCCCGATTT...TAAGAATATAACTTTTGTTTAG BPRRCC1105_14G02990
 [3]  2124 ATGTCCACCGATGAGTTGGACAT...TCACCAAATCCATAGAATTTAG ORCC809_02G02140
 [4]  1950 ATGCTCTATGGTGATATTCTACG...TAGTATTACGTTCGTTAGCTAG ORCC809_02G01300
 [5]  2154 ATGGCCGATGAAGCATACACGCG...CAACATAACGTTCGTAGACTAG OL02G04530
 [6]  2124 ATGGAACGCGAGTTCGAAGCCTT...TGGAATTGTGCACATCATTTGA OL02G02820
 [7]  2139 ATGTTTGAACCAGATTATCGAAG...AAACATCACGTTCGTAGAATAG OM_04G02150
 [8]  2133 ATGAACAACGACGGCGTCGACGG...GGGTCACATCCACTGTCTCTAG OM_04G01300
 [9]  2124 ATGGAGCGCGAGTTCGTGAATTT...AGGCACGGTGCACATTATTTAG OT_02G00670
[10]  2133 ATGAGTGAAGAGGTGTACGCTCA...TAATATTACATTCATAGAGTAA OT_02G01820

Coding sequence alignment

To calculate a coding sequence alignment for two sequences:

# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_aln
DNAStringSet object of length 2:
    width seq                                               names               
[1]  2367 ATGAGCATAGAAACCAAGCAACG...ATATC---ATATATTCGGTGTGA BPRRCC1105_14G02260
[2]  2367 ATGGGCGTACCCGATCCC-----...ATAAGAATATAACTTTTGTTTAG BPRRCC1105_14G02990

To calculate dN/dS from this alignment:

# Calculate dN/dS for this alignment; model = Li
cds1_cds2_kaks <- dnastring2kaks(cds1_cds2_aln,
    model = "Li", threads = 1, isMSA = TRUE)
cds1_cds2_kaks
  Comp1 Comp2                seq1                seq2        ka       ks
1     1     2 BPRRCC1105_14G02260 BPRRCC1105_14G02990 0.7036953 9.999999
          vka      vks
1 0.002546956 9.999999

Parallelized pairwise coding sequence alignments

To calculate all pairwise combinations:

start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
    model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()

head(my_cds_kaks)
         Comp1 Comp2                seq1                seq2 Method       Ka
result.1     1     2 BPRRCC1105_14G02260 BPRRCC1105_14G02990     YN  0.68691
result.2     1     3 BPRRCC1105_14G02260    ORCC809_02G02140     YN 0.385174
result.3     1     4 BPRRCC1105_14G02260    ORCC809_02G01300     YN 0.714543
result.4     1     5 BPRRCC1105_14G02260          OL02G04530     YN 0.734026
result.5     1     6 BPRRCC1105_14G02260          OL02G02820     YN 0.370458
result.6     1     7 BPRRCC1105_14G02260         OM_04G02150     YN 0.670571
              Ks     Ka/Ks P-Value(Fisher) Length S-Sites N-Sites
result.1 4.48181  0.153266     2.47864e-58   1911 393.764 1517.24
result.2 4.64889 0.0828529    6.04249e-105   2052 492.021 1559.98
result.3 4.54757  0.157126     7.84145e-49   1887 429.845 1457.15
result.4  4.5871   0.16002     1.32653e-43   1941 453.107 1487.89
result.5 4.59787 0.0805717    1.22049e-143   2010 459.662 1550.34
result.6 4.56461  0.146907     3.77102e-72   1896 439.721 1456.28
         Fold-Sites(0:2:4) Substitutions S-Substitutions N-Substitutions
result.1                NA          1026         345.973         680.027
result.2                NA           885         415.871         469.129
result.3                NA          1032         362.557         669.443
result.4                NA          1067         372.541         694.459
result.5                NA           880          427.41          452.59
result.6                NA          1039         395.111         643.889
         Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1                          NA                          NA
result.2                          NA                          NA
result.3                          NA                          NA
result.4                          NA                          NA
result.5                          NA                          NA
result.6                          NA                          NA
         Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1         1.46886                              1.26809:1.26809:1:1:1:1
result.2         1.40751                                1.5774:1.5774:1:1:1:1
result.3         1.58768                              1.29662:1.29662:1:1:1:1
result.4         1.63349                                1.2953:1.2953:1:1:1:1
result.5         1.33721                                1.3531:1.3531:1:1:1:1
result.6         1.57368                              1.32171:1.32171:1:1:1:1
                                    GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1 0.360794(0.455006:0.333333:0.294043)       NA   NA            NA    NA
result.2 0.408784(0.501351:0.372297:0.352703)       NA   NA            NA    NA
result.3  0.39507(0.489145:0.358209:0.337856)       NA   NA            NA    NA
result.4 0.396866(0.481576:0.361499:0.347522)       NA   NA            NA    NA
result.5  0.402299(0.48939:0.375332:0.342175)       NA   NA            NA    NA
result.6 0.380594(0.465496:0.343162:0.333124)       NA   NA            NA    NA

How long did it take?

end_time - start_time
Time difference of 6.824694 secs

Calculate average behavior of each codon

Here, a pre-calculated MSA is necessary:

# load example data
data(hiv, package="MSA2dist")

# calculate average behavior
hiv.xy <- hiv |> dnastring2codonmat() |> codonmat2xy()

Plot average behavior of each codon:

Here’s where you can find me: